Gemma 3 is a lightweight multimodal model launched by Google, capable of processing text and image inputs and generating text outputs. This model has a 128K large context window and multilingual support for over 140 languages.
Image-to-Text
Transformers